所有物种基因Symbol别名转换为最新Symbol

您所在的位置:网站首页 geneid转换为gene symbol 所有物种基因Symbol别名转换为最新Symbol

所有物种基因Symbol别名转换为最新Symbol

2023-03-27 19:12| 来源: 网络整理| 查看: 265

需求

在数据分析中会经常出现感兴趣的基因不在矩阵中,可能的原因是没有测到和旧版Symbol。因此需要找到旧版Symbol(Alias别名)和最新Symbol(Current Symbol)之间的对应关系。

bq.tl.current_symbol可以把(表达)矩阵中的Symbol变为最新版

第一个参数数据框(index为Symbol) 第二个参数Symbol与Alias对应关系文件路径 第三个参数物种tax_id比如人的是9606。

SymbolAlias_20230317.feather的获取可以发送邮件到[email protected]

从NCBI下载最新的基因信息https://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz

import numpy as npimport pandas as pd

import bioquest as bq

得到Symbol与Alias对应关系 g=pd.read_csv("gene_info_20230317.gz",sep='\t',usecols=['#tax_id','GeneID','Symbol','Synonyms'])g.rename(columns={"#tax_id":"tax_id"},inplace=True)g.loc[:,"Alias"]=g.Synonyms.str.split('|')g = g.explode("Alias")g = bq.tl.select(g,columns=["tax_id","GeneID","Symbol","Alias"])g.reset_index(drop=True,inplace=True)g.replace({'Alias': {'-':''}},inplace=True)g.to_feather("SymbolAlias_20230317.feather",compression='zstd',compression_level=1) tax_id GeneID Symbol Alias 0 7 5692769 NEWENTRY 1 9 2827857 NEWENTRY 2 11 10823747 NEWENTRY 3 14 6951813 NEWENTRY 4 19 3758873 NEWENTRY ... ... ... ... ... 44205723 3032134 60460443 ND6 44205724 3032134 60460444 ND1 44205725 3032134 60460445 I9997_mgr02 44205726 3032134 60460446 I9997_mgt22 44205727 3032134 60460447 I9997_mgr01 [44205728 rows x 4 columns] 使用示例 示例数据 df = pd.read_csv("BLCA.csv",index_col="Gene Symbol")#                                                      Gene Name       Species# Gene Symbol                                                                 # ATP2B1            ATPase, Ca++ transporting, plasma membrane 1  Homo sapiens# MYL6         myosin, light chain 6, alkali, smooth muscle a...  Homo sapiens# RPS16                                    ribosomal protein S16  Homo sapiens# HIST1H2BA                              histone cluster 1, H2ba  Homo sapiens# H2AFY2                           H2A histone family, member Y2  Homo sapiens# ...                                                        ...           ...# UBB                                                ubiquitin B  Homo sapiens# PYGB                            phosphorylase, glycogen; brain  Homo sapiens# HLA-A             major histocompatibility complex, class I, A  Homo sapiens# HSPA1A                             heat shock 70kDa protein 1A  Homo sapiens# HSP90AB1     heat shock protein 90kDa alpha (cytosolic), cl...  Homo sapiens

转换 bq.tl.current_symbol(frame=df,reference="SymbolAlias_20230317.feather", tax_id=9606)#                                                   Gene Name       Species  \# H2BC1                                histone cluster 1, H2ba  Homo sapiens   # MACROH2A2                      H2A histone family, member Y2  Homo sapiens   # H3-3B                          H3 histone, family 3B (H3.3B)  Homo sapiens   # H1-5                                  histone cluster 1, H1b  Homo sapiens   # DARS1                               aspartyl-tRNA synthetase  Homo sapiens   # ...                                                      ...           ...   # UBB                                              ubiquitin B  Homo sapiens   # PYGB                          phosphorylase, glycogen; brain  Homo sapiens   # HLA-A           major histocompatibility complex, class I, A  Homo sapiens   # HSPA1A                           heat shock 70kDa protein 1A  Homo sapiens   # HSP90AB1   heat shock protein 90kDa alpha (cytosolic), cl...  Homo sapiens   

#                Alias  # H2BC1      HIST1H2BA  # MACROH2A2     H2AFY2  # H3-3B          H3F3B  # H1-5        HIST1H1B  # DARS1           DARS  # ...              ...  # UBB              NaN  # PYGB             NaN  # HLA-A            NaN  # HSPA1A           NaN  # HSP90AB1         NaN  

# [378 rows x 3 columns]



【本文地址】


今日新闻


推荐新闻


    CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3